Building a song recommender

Fire up GraphLab Create


In [2]:
import graphlab

Load music data


In [3]:
song_data = graphlab.SFrame('song_data.gl/')

Explore data

Music data shows how many times a user listened to a song, as well as the details of the song.


In [4]:
song_data.head(5)


Out[4]:
user_id song_id listen_count title artist
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOAKIMP12A8C130995 1 The Cove Jack Johnson
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBBMDR12A8C13253B 2 Entre Dos Aguas Paco De Lucia
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBXHDL12A81C204C0 1 Stronger Kanye West
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SOBYHAJ12A6701BF1D 1 Constellations Jack Johnson
b80344d063b5ccb3212f76538
f3d9e43d87dca9e ...
SODACBL12A8C13C273 1 Learn To Fly Foo Fighters
song
The Cove - Jack Johnson
Entre Dos Aguas - Paco De
Lucia ...
Stronger - Kanye West
Constellations - Jack
Johnson ...
Learn To Fly - Foo
Fighters ...
[5 rows x 6 columns]


In [5]:
graphlab.canvas.set_target('ipynb')

In [6]:
song_data['song'].show()



In [7]:
len(song_data)


Out[7]:
1116609

Count number of unique users in the dataset


In [8]:
users = song_data['user_id'].unique()

In [10]:
len(users)


Out[10]:
66346

Create a song recommender

first split dataset in train and set


In [11]:
train_data,test_data = song_data.random_split(.8,seed=0)

Simple popularity-based recommender


In [12]:
popularity_model = graphlab.popularity_recommender.create(train_data,
                                                         user_id='user_id',
                                                         item_id='song')


PROGRESS: Recsys training: model = popularity
PROGRESS: Warning: Ignoring columns song_id, listen_count, title, artist;
PROGRESS:     To use one of these as a target column, set target = <column_name>
PROGRESS:     and use a method that allows the use of a target.
PROGRESS: Preparing data set.
PROGRESS:     Data has 893580 observations with 66085 users and 9952 items.
PROGRESS:     Data prepared in: 15.6384s
PROGRESS: 893580 observations to process; with 9952 unique items.

Use the popularity model to make some predictions

A popularity model makes the same prediction for all users, so provides no personalization.


In [13]:
popularity_model.recommend(users=[users[0]])


Out[13]:
user_id song score rank
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Sehr kosmisch - Harmonia 4754.0 1
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Undo - Björk 4227.0 2
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
You're The One - Dwight
Yoakam ...
3781.0 3
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Dog Days Are Over (Radio
Edit) - Florence + The ...
3633.0 4
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Revelry - Kings Of Leon 3527.0 5
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Horn Concerto No. 4 in E
flat K495: II. Romance ...
3161.0 6
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Secrets - OneRepublic 3148.0 7
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Fireflies - Charttraxx
Karaoke ...
2532.0 8
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Tive Sim - Cartola 2521.0 9
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Drop The World - Lil
Wayne / Eminem ...
2053.0 10
[10 rows x 4 columns]


In [14]:
popularity_model.recommend(users=[users[1]])


Out[14]:
user_id song score rank
02f015d32ac2cd1e52d26e3ec
36048711dd5711b ...
Sehr kosmisch - Harmonia 4754.0 1
02f015d32ac2cd1e52d26e3ec
36048711dd5711b ...
Undo - Björk 4227.0 2
02f015d32ac2cd1e52d26e3ec
36048711dd5711b ...
You're The One - Dwight
Yoakam ...
3781.0 3
02f015d32ac2cd1e52d26e3ec
36048711dd5711b ...
Dog Days Are Over (Radio
Edit) - Florence + The ...
3633.0 4
02f015d32ac2cd1e52d26e3ec
36048711dd5711b ...
Revelry - Kings Of Leon 3527.0 5
02f015d32ac2cd1e52d26e3ec
36048711dd5711b ...
Horn Concerto No. 4 in E
flat K495: II. Romance ...
3161.0 6
02f015d32ac2cd1e52d26e3ec
36048711dd5711b ...
Secrets - OneRepublic 3148.0 7
02f015d32ac2cd1e52d26e3ec
36048711dd5711b ...
Hey_ Soul Sister - Train 2538.0 8
02f015d32ac2cd1e52d26e3ec
36048711dd5711b ...
Fireflies - Charttraxx
Karaoke ...
2532.0 9
02f015d32ac2cd1e52d26e3ec
36048711dd5711b ...
Tive Sim - Cartola 2521.0 10
[10 rows x 4 columns]

This model shows same songs for the use based on the popularity of the songs

Build a song recommender with personalization

We now create a model that allows us to make personalized recommendations to each user.


In [15]:
personalized_model = graphlab.item_similarity_recommender.create(train_data,
                                                                user_id='user_id',
                                                                item_id='song')


PROGRESS: Recsys training: model = item_similarity
PROGRESS: Warning: Ignoring columns song_id, listen_count, title, artist;
PROGRESS:     To use one of these as a target column, set target = <column_name>
PROGRESS:     and use a method that allows the use of a target.
PROGRESS: Preparing data set.
PROGRESS:     Data has 893580 observations with 66085 users and 9952 items.
PROGRESS:     Data prepared in: 13.4025s
PROGRESS: Computing item similarity statistics:
PROGRESS: Computing most similar items for 9952 items:
PROGRESS: +-----------------+-----------------+
PROGRESS: | Number of items | Elapsed Time    |
PROGRESS: +-----------------+-----------------+
PROGRESS: | 1000            | 14.3088         |
PROGRESS: | 2000            | 15.5235         |
PROGRESS: | 3000            | 16.6664         |
PROGRESS: | 4000            | 17.8324         |
PROGRESS: | 5000            | 19.0154         |
PROGRESS: | 6000            | 20.0929         |
PROGRESS: | 7000            | 21.0519         |
PROGRESS: | 8000            | 22.3922         |
PROGRESS: | 9000            | 23.9032         |
PROGRESS: +-----------------+-----------------+
PROGRESS: Finished training in 28.5542s

Applying the personalized model to make song recommendations

As you can see, different users get different recommendations now.


In [16]:
personalized_model.recommend(users=[users[0]])


Out[16]:
user_id song score rank
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Cuando Pase El Temblor -
Soda Stereo ...
0.0194504525792 1
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Fireflies - Charttraxx
Karaoke ...
0.014473730789 2
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Love Is A Losing Game -
Amy Winehouse ...
0.0142865986808 3
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Marry Me - Train 0.0141334715267 4
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Secrets - OneRepublic 0.0135916683588 5
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
No Dejes Que... -
Caifanes ...
0.0134191754754 6
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Sehr kosmisch - Harmonia 0.0133987908186 7
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Y solo se me ocurre
amarte (Unplugged) - ...
0.0133210385369 8
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
Te Hacen Falta Vitaminas
- Soda Stereo ...
0.0129302853556 9
c66c10a9567f0d82ff31441a9
fd5063e5cd9dfe8 ...
OMG - Usher featuring
will.i.am ...
0.0127778293142 10
[10 rows x 4 columns]


In [17]:
personalized_model.recommend(users=[users[1]])


Out[17]:
user_id song score rank
02f015d32ac2cd1e52d26e3ec
36048711dd5711b ...
Where The Boat Leaves
From (Album) - Zac Brown ...
0.063530766032 1
02f015d32ac2cd1e52d26e3ec
36048711dd5711b ...
Different Kind Of Fine
(Album) - Zac Brown Band ...
0.0628011029296 2
02f015d32ac2cd1e52d26e3ec
36048711dd5711b ...
Jolene (Album) - Zac
Brown Band ...
0.0578682052943 3
02f015d32ac2cd1e52d26e3ec
36048711dd5711b ...
Sic 'Em On A Chicken
(Album) - Zac Brown Band ...
0.0551866929279 4
02f015d32ac2cd1e52d26e3ec
36048711dd5711b ...
Who's Kissing You Tonight
- Jason Aldean ...
0.0547525233792 5
02f015d32ac2cd1e52d26e3ec
36048711dd5711b ...
Highway 20 Ride (Album) -
Zac Brown Band ...
0.0398751780992 6
02f015d32ac2cd1e52d26e3ec
36048711dd5711b ...
What Country Is - Luke
Bryan ...
0.0374908065185 7
02f015d32ac2cd1e52d26e3ec
36048711dd5711b ...
Do I - Luke Bryan 0.0350614821658 8
02f015d32ac2cd1e52d26e3ec
36048711dd5711b ...
One Fine Wire - Colbie
Caillat ...
0.03125 9
02f015d32ac2cd1e52d26e3ec
36048711dd5711b ...
Midnight Bottle - Colbie
Caillat ...
0.030737704918 10
[10 rows x 4 columns]

We can also apply the model to find similar songs to any song in the dataset


In [18]:
personalized_model.get_similar_items(['With Or Without You - U2'])


PROGRESS: Getting similar items completed in 0.10229
Out[18]:
song similar score rank
With Or Without You - U2 I Still Haven't Found
What I'm Looking For ...
0.0428571428571 1
With Or Without You - U2 Hold Me_ Thrill Me_ Kiss
Me_ Kill Me - U2 ...
0.033734939759 2
With Or Without You - U2 Window In The Skies - U2 0.0328358208955 3
With Or Without You - U2 Vertigo - U2 0.0300751879699 4
With Or Without You - U2 Sunday Bloody Sunday - U2 0.0271317829457 5
With Or Without You - U2 Bad - U2 0.0251798561151 6
With Or Without You - U2 A Day Without Me - U2 0.0237154150198 7
With Or Without You - U2 Another Time Another
Place - U2 ...
0.020325203252 8
With Or Without You - U2 Walk On - U2 0.020202020202 9
With Or Without You - U2 Get On Your Boots - U2 0.0196850393701 10
[10 rows x 4 columns]


In [19]:
personalized_model.get_similar_items(['Chan Chan (Live) - Buena Vista Social Club'])


PROGRESS: Getting similar items completed in 0.008413
Out[19]:
song similar score rank
Chan Chan (Live) - Buena
Vista Social Club ...
Murmullo - Buena Vista
Social Club ...
0.188118811881 1
Chan Chan (Live) - Buena
Vista Social Club ...
La Bayamesa - Buena Vista
Social Club ...
0.187192118227 2
Chan Chan (Live) - Buena
Vista Social Club ...
Amor de Loca Juventud -
Buena Vista Social Club ...
0.184834123223 3
Chan Chan (Live) - Buena
Vista Social Club ...
Diferente - Gotan Project 0.0214592274678 4
Chan Chan (Live) - Buena
Vista Social Club ...
Mistica - Orishas 0.0205761316872 5
Chan Chan (Live) - Buena
Vista Social Club ...
Hotel California - Gipsy
Kings ...
0.019305019305 6
Chan Chan (Live) - Buena
Vista Social Club ...
Nací Orishas - Orishas 0.0191570881226 7
Chan Chan (Live) - Buena
Vista Social Club ...
Le Moulin - Yann Tiersen 0.0187969924812 8
Chan Chan (Live) - Buena
Vista Social Club ...
Gitana - Willie Colon 0.0187969924812 9
Chan Chan (Live) - Buena
Vista Social Club ...
Criminal - Gotan Project 0.018779342723 10
[10 rows x 4 columns]

Quantitative comparison between the models

We now formally compare the popularity and the personalized models using precision-recall curves.


In [20]:
if graphlab.version[:3] >= "1.6":
    model_performance = graphlab.compare(test_data, [popularity_model, personalized_model], user_sample=0.05)
    graphlab.show_comparison(model_performance,[popularity_model, personalized_model])
else:
    %matplotlib inline
    model_performance = graphlab.recommender.util.compare_models(test_data, [popularity_model, personalized_model], user_sample=.05)


compare_models: using 2931 users to estimate model performance
PROGRESS: Evaluate model M0
PROGRESS: recommendations finished on 1000/2931 queries. users per second: 231.98
PROGRESS: recommendations finished on 2000/2931 queries. users per second: 239.53

Precision and recall summary statistics by cutoff
+--------+-----------------+------------------+
| cutoff |  mean_precision |   mean_recall    |
+--------+-----------------+------------------+
|   1    | 0.0296827021494 | 0.00727050821728 |
|   2    | 0.0284885704538 | 0.0147388444983  |
|   3    | 0.0249061753668 | 0.0192294418497  |
|   4    | 0.0237973387922 | 0.0242431494734  |
|   5    | 0.0219037871034 | 0.0284013666306  |
|   6    | 0.0214375071079 | 0.0343936911491  |
|   7    | 0.0208120095531 | 0.0385803068761  |
|   8    | 0.0195325827363 | 0.0412611375508  |
|   9    | 0.0188407445316 | 0.0445268159188  |
|   10   | 0.0180143295803 | 0.0471908460394  |
+--------+-----------------+------------------+
[10 rows x 3 columns]

PROGRESS: Evaluate model M1
PROGRESS: recommendations finished on 1000/2931 queries. users per second: 83.7665
PROGRESS: recommendations finished on 2000/2931 queries. users per second: 83.8638

Precision and recall summary statistics by cutoff
+--------+-----------------+-----------------+
| cutoff |  mean_precision |   mean_recall   |
+--------+-----------------+-----------------+
|   1    |  0.17604912999  | 0.0547009702076 |
|   2    |  0.146025247356 | 0.0865755553064 |
|   3    |  0.129307403617 |  0.112642732755 |
|   4    |  0.116854315933 |  0.132175285195 |
|   5    |  0.107267144319 |  0.148674170706 |
|   6    | 0.0987148868418 |  0.164266824195 |
|   7    | 0.0924111712239 |  0.181159315115 |
|   8    | 0.0875554418287 |  0.196751423261 |
|   9    | 0.0825277683005 |  0.208081462304 |
|   10   | 0.0780620948482 |  0.217065154604 |
+--------+-----------------+-----------------+
[10 rows x 3 columns]

Model compare metric: precision_recall

The curve shows that the personalized model provides much better performance.


In [ ]: